Over the past few years, concerns about the environmental impact of animal agriculture have grown. It is important to understand how our food choices affect the planet. By analyzing data on the environmental impacts of different types of food, we can gain valuable insights into how these products affect the environment.
These concerns are the driving force behind this analysis. In particular, animal agriculture is a significant contributor to greenhouse gas emissions. It is responsible for global methane and nitrous oxide emissions, which are powerful greenhouse gases. In addition, animal agriculture also uses large amounts of water and land.
To carry out this analysis, we will use Pandas for data manipulation and analysis, Seaborn for data visualization, Matplotlib for creating custom plots, SciPy for statistical tests and analysis, and SciKit-Learn for machine learning.
This tutorial will guide users through a typical data science pipeline, while also providing information about the issue of animal agriculture on the environment and why it is a pressing problem.
Initially, we must collect relevant data from the internet to conduct our analysis:
We will download each respective CSV file and then use Pandas to read and manipulate them accordingly to set up for data management.
import pandas as pd # Import Pandas library
CSV1 = 'Food_Production.csv' # Initialize CSV file containing environmental impact of food production data
FOOD_PRODUCTS = ["Bread", "Maize", "Barley",
"Oatmeal", "Rice", "Potato",
"Cassava", "Sugar", "Beet Sugar",
"Pulse", "Pea", "Nut",
"Groundnut", "Soymilk", "Tofu",
"Soybean Oil", "Palm Oil", "Sunflower Oil",
"Rapeseed Oil", "Olive Oil", "Tomato",
"Onion & Leek", "Root Veg.", "Brassica",
"Veg.", "Citrus Fruit", "Banana",
"Apple", "Berry & Grape", "Wine",
"Fruit", "Coffee", "Dark Chocolate",
"Beef", "Dairy Beef", "Lamb",
"Pig", "Poultry", "Milk",
"Cheese", "Egg", "Fish", "Shrimp"] # Create list of abbreviations for clearer graph labels (current ones are too long)
df1 = pd.read_csv(CSV1) # Read CSV file and store as a Pandas dataframe
df1.loc[:, "Food product"] = FOOD_PRODUCTS # Replace column in the dataframe called "Food product" and initialize with abbreviations
df1.head() # Print the first few entries of the dataframe
| Food product | Land use change | Animal Feed | Farm | Processing | Transport | Packging | Retail | Total_emissions | Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal) | ... | Freshwater withdrawals per 100g protein (liters per 100g protein) | Freshwater withdrawals per kilogram (liters per kilogram) | Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal) | Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein) | Land use per 1000kcal (m² per 1000kcal) | Land use per kilogram (m² per kilogram) | Land use per 100g protein (m² per 100g protein) | Scarcity-weighted water use per kilogram (liters per kilogram) | Scarcity-weighted water use per 100g protein (liters per 100g protein) | Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bread | 0.1 | 0.0 | 0.8 | 0.2 | 0.1 | 0.1 | 0.1 | 1.4 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | Maize | 0.3 | 0.0 | 0.5 | 0.1 | 0.1 | 0.1 | 0.0 | 1.1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Barley | 0.0 | 0.0 | 0.2 | 0.1 | 0.0 | 0.5 | 0.3 | 1.1 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | Oatmeal | 0.0 | 0.0 | 1.4 | 0.0 | 0.1 | 0.1 | 0.0 | 1.6 | 4.281357 | ... | 371.076923 | 482.4 | 0.945482 | 1.907692 | 2.897446 | 7.6 | 5.846154 | 18786.2 | 14450.92308 | 7162.104461 |
| 4 | Rice | 0.0 | 0.0 | 3.6 | 0.1 | 0.1 | 0.1 | 0.1 | 4.0 | 9.514379 | ... | 3166.760563 | 2248.4 | 1.207271 | 6.267606 | 0.759631 | 2.8 | 3.943662 | 49576.3 | 69825.77465 | 13449.891480 |
5 rows × 23 columns
Now that we have created our first dataframe containing the first dataset, let's create the next one.
CSV2 = 'meat_consumption_worldwide.csv' # Initialize CSV file containing meat consumption worldwide data
df2 = pd.read_csv(CSV2) # Read CSV file and store as a Pandas dataframe
df2.drop(df2.loc[df2["LOCATION"] == "WLD"].index, inplace = True) # Remove WLD (global) from the dataframe, since we only want to look at individual countries
df2.head() # Print the first few entries of the dataframe
| LOCATION | SUBJECT | MEASURE | TIME | Value | |
|---|---|---|---|---|---|
| 0 | AUS | BEEF | KG_CAP | 1991 | 27.721815 |
| 1 | AUS | BEEF | KG_CAP | 1992 | 26.199591 |
| 2 | AUS | BEEF | KG_CAP | 1993 | 26.169094 |
| 3 | AUS | BEEF | KG_CAP | 1994 | 25.456134 |
| 4 | AUS | BEEF | KG_CAP | 1995 | 25.340226 |
We are done reading our data, so we can continue to the next stage where we will deal with the NaN values.
To ensure our data can be used coherently in a representation through graphs, we must manage it by creating new dataframes that can be easily graphed using Seaborn.
We will split df1 into two dataframes: df3 and df4. df3 will contain the first 9 columns, which are data calculated relative to each other. df4 will contain all the columns after, which are data calculated by a certain metric. This will be clearer when we graph the data.
END = 9 # Set the end range
df3 = df1.iloc[:, :END] # Keep the first 9 columns and store it as df1
df3.head() # Print the first few entries of the dataframe
| Food product | Land use change | Animal Feed | Farm | Processing | Transport | Packging | Retail | Total_emissions | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Bread | 0.1 | 0.0 | 0.8 | 0.2 | 0.1 | 0.1 | 0.1 | 1.4 |
| 1 | Maize | 0.3 | 0.0 | 0.5 | 0.1 | 0.1 | 0.1 | 0.0 | 1.1 |
| 2 | Barley | 0.0 | 0.0 | 0.2 | 0.1 | 0.0 | 0.5 | 0.3 | 1.1 |
| 3 | Oatmeal | 0.0 | 0.0 | 1.4 | 0.0 | 0.1 | 0.1 | 0.0 | 1.6 |
| 4 | Rice | 0.0 | 0.0 | 3.6 | 0.1 | 0.1 | 0.1 | 0.1 | 4.0 |
START = 9 # Set the start range
END = 23 # Set the end range
RANGE = [0] + list(range(START, END)) # Set the range of columns we want
df4 = df1.copy() # Create a copy to prevent warning
df4 = df4.iloc[:, RANGE] # Keep the columns after the 9th
df4.dropna(inplace = True) # Remove all NaNs since they don't provide any data
df4.reset_index(drop = True, inplace = True) # Reset our index for clarity
df4.head() # Print the first few entries of the dataframe
| Food product | Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal) | Eutrophying emissions per kilogram (gPO₄eq per kilogram) | Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein) | Freshwater withdrawals per 1000kcal (liters per 1000kcal) | Freshwater withdrawals per 100g protein (liters per 100g protein) | Freshwater withdrawals per kilogram (liters per kilogram) | Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal) | Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein) | Land use per 1000kcal (m² per 1000kcal) | Land use per kilogram (m² per kilogram) | Land use per 100g protein (m² per 100g protein) | Scarcity-weighted water use per kilogram (liters per kilogram) | Scarcity-weighted water use per 100g protein (liters per 100g protein) | Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Oatmeal | 4.281357 | 11.23 | 8.638462 | 183.911552 | 371.076923 | 482.4 | 0.945482 | 1.907692 | 2.897446 | 7.60 | 5.846154 | 18786.2 | 14450.92308 | 7162.104461 |
| 1 | Rice | 9.514379 | 35.07 | 49.394366 | 609.983722 | 3166.760563 | 2248.4 | 1.207271 | 6.267606 | 0.759631 | 2.80 | 3.943662 | 49576.3 | 69825.77465 | 13449.891480 |
| 2 | Potato | 4.754098 | 3.48 | 20.470588 | 80.737705 | 347.647059 | 59.1 | 0.628415 | 2.705882 | 1.202186 | 0.88 | 5.176471 | 2754.2 | 16201.17647 | 3762.568306 |
| 3 | Nut | 3.113821 | 19.15 | 11.726883 | 672.162602 | 2531.414574 | 4133.8 | 0.069919 | 0.263319 | 2.107317 | 12.96 | 7.936314 | 229889.8 | 140777.58730 | 37380.455280 |
| 4 | Groundnut | 2.437931 | 14.14 | 5.401070 | 319.362069 | 707.524828 | 1852.3 | 0.556897 | 1.233766 | 1.570690 | 9.11 | 3.479756 | 61797.9 | 23605.00382 | 10654.810340 |
PLANT_PRODUCTS = {"Bread", "Maize", "Barley",
"Oatmeal", "Rice", "Potato",
"Cassava", "Sugar", "Beet Sugar",
"Pulse", "Pea", "Nut",
"Groundnut", "Soymilk", "Tofu",
"Soybean Oil", "Palm Oil", "Sunflower Oil",
"Rapeseed Oil", "Olive Oil", "Tomato",
"Onion & Leek", "Root Veg.", "Brassica",
"Veg.", "Citrus Fruit", "Banana",
"Apple", "Berry & Grape", "Wine",
"Fruit", "Coffee", "Dark Chocolate"}
ANIMAL_PRODUCTS = {"Beef", "Dairy Beef", "Lamb",
"Pig", "Poultry", "Milk",
"Cheese", "Egg", "Fish", "Shrimp"}
df5 = df3.copy()
df6 = df4.copy()
df5['Food type'] = None
df6['Food type'] = None
for i, r in df5.iterrows():
if r['Food product'] in PLANT_PRODUCTS:
df5.loc[i, 'Food type'] = 'Plant'
elif r['Food product'] in ANIMAL_PRODUCTS:
df5.loc[i, 'Food type'] = 'Animal'
for i, r in df6.iterrows():
if r['Food product'] in PLANT_PRODUCTS:
df6.loc[i, 'Food type'] = 'Plant'
elif r['Food product'] in ANIMAL_PRODUCTS:
df6.loc[i, 'Food type'] = 'Animal'
df5 = df5.groupby("Food type").mean()
df5["Food type"] = df5.index
df5.reset_index(drop = True, inplace = True)
df5
| Land use change | Animal Feed | Farm | Processing | Transport | Packging | Retail | Total_emissions | Food type | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.810000 | 1.95 | 10.490000 | 0.500000 | 0.240000 | 0.220000 | 0.180000 | 16.390000 | Animal |
| 1 | 0.790909 | 0.00 | 1.342424 | 0.178788 | 0.181818 | 0.284848 | 0.036364 | 2.815152 | Plant |
df6 = df6.groupby("Food type").mean()
df6["Food type"] = df6.index
df6.reset_index(drop = True, inplace = True)
df6
| Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal) | Eutrophying emissions per kilogram (gPO₄eq per kilogram) | Eutrophying emissions per 100g protein (gPO₄eq per 100 grams protein) | Freshwater withdrawals per 1000kcal (liters per 1000kcal) | Freshwater withdrawals per 100g protein (liters per 100g protein) | Freshwater withdrawals per kilogram (liters per kilogram) | Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal) | Greenhouse gas emissions per 100g protein (kgCO₂eq per 100g protein) | Land use per 1000kcal (m² per 1000kcal) | Land use per kilogram (m² per kilogram) | Land use per 100g protein (m² per 100g protein) | Scarcity-weighted water use per kilogram (liters per kilogram) | Scarcity-weighted water use per 100g protein (liters per 100g protein) | Scarcity-weighted water use per 1000kcal (liters per 1000 kilocalories) | Food type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 58.085221 | 139.423333 | 73.289922 | 906.822192 | 1230.689663 | 2102.944444 | 10.436802 | 14.490724 | 34.723544 | 97.806667 | 51.590835 | 70855.533333 | 41650.170546 | 28049.055284 | Animal |
| 1 | 23.109872 | 20.742667 | 49.746990 | 422.777740 | 1728.610704 | 711.420000 | 5.741551 | 14.587478 | 5.421294 | 8.788667 | 19.385191 | 28073.860000 | 76078.624420 | 15746.276693 | Plant |
Next, we will split df2 into three dataframes: df7, df8, and df9. df7 will contain 2022 data on relative meat consumption per country. df8 will contain data on how each country's meat consumption changes every year. df9 will contain data on how all countries' meat consumption changes every year.
df7 = df2.loc[df2["TIME"] == 2022] # Filter to contain only 2022 data and store it as df7
df7.reset_index(drop = True, inplace = True) # Reset our index for clarity
df7.head() # Print the first few entries of the dataframe
| LOCATION | SUBJECT | MEASURE | TIME | Value | |
|---|---|---|---|---|---|
| 0 | AUS | BEEF | KG_CAP | 2022 | 19.951395 |
| 1 | AUS | PIG | KG_CAP | 2022 | 20.795147 |
| 2 | AUS | POULTRY | KG_CAP | 2022 | 45.974985 |
| 3 | AUS | SHEEP | KG_CAP | 2022 | 8.606749 |
| 4 | CAN | BEEF | KG_CAP | 2022 | 18.426918 |
We should group the data by location since we want to observe the total meat consumption per country in our visualization.
df7 = df7.groupby("LOCATION").mean() # Group the data by location and average other values
df7["LOCATION"] = df7.index # Add "LOCATION" back as a column (after it's removed by groupby)
df7.reset_index(drop = True, inplace = True) # Reset the index and remove the "LOCATION" index
df7.drop(columns = ["TIME"], inplace = True) # Drop "TIME" since we're only looking at 2022
df7.head() # Print the first few entries of the dataframe
| Value | LOCATION | |
|---|---|---|
| 0 | 654.438777 | ARG |
| 1 | 395.792480 | AUS |
| 2 | 92.761958 | BGD |
| 3 | 2742.902830 | BRA |
| 4 | 16958.468271 | BRICS |
Since we aim to visualize this data, we need to come up with a solid range of values to separate relative meat consumption into categories so we can clearly see which countries consume more meat than others.
Let's sort this data and view it all.
df7.sort_values("Value")
| Value | LOCATION | |
|---|---|---|
| 15 | 28.316855 | HTI |
| 24 | 40.967186 | MOZ |
| 14 | 48.032780 | GHA |
| 46 | 48.279836 | ZMB |
| 42 | 58.287311 | URY |
| 28 | 63.785447 | NZL |
| 33 | 72.225519 | PRY |
| 40 | 77.438105 | TZA |
| 12 | 84.938557 | ETH |
| 2 | 92.761958 | BGD |
| 27 | 97.187500 | NOR |
| 19 | 119.053703 | ISR |
| 10 | 119.396307 | DZA |
| 36 | 119.653849 | SDN |
| 21 | 138.128745 | KAZ |
| 6 | 149.544583 | CHE |
| 26 | 199.659620 | NGA |
| 7 | 224.346100 | CHL |
| 38 | 247.684181 | THA |
| 31 | 247.995524 | PER |
| 25 | 279.695257 | MYS |
| 35 | 286.645753 | SAU |
| 41 | 304.334280 | UKR |
| 11 | 331.591675 | EGY |
| 9 | 357.508500 | COL |
| 18 | 378.662937 | IRN |
| 39 | 395.481716 | TUR |
| 1 | 395.792480 | AUS |
| 5 | 421.909062 | CAN |
| 45 | 443.961610 | ZAF |
| 30 | 451.054780 | PAK |
| 22 | 473.225893 | KOR |
| 16 | 498.350871 | IDN |
| 32 | 524.246670 | PHL |
| 0 | 654.438777 | ARG |
| 17 | 709.425892 | IND |
| 20 | 720.685672 | JPN |
| 44 | 893.342928 | VNM |
| 23 | 1044.307995 | MEX |
| 34 | 1393.345149 | RUS |
| 37 | 1468.064750 | SSA |
| 3 | 2742.902830 | BRA |
| 43 | 5193.676467 | USA |
| 13 | 5500.365924 | EU28 |
| 8 | 11695.816209 | CHN |
| 29 | 14593.153328 | OECD |
| 4 | 16958.468271 | BRICS |
From this, we can intuitively suggest the following range values: 0, 100, 500, 1000, 5000, 10000, and 20000.
Next, we can create df6 and df7 accordingly.
df8 = df2.groupby(["LOCATION", "TIME"]).mean()
df8.reset_index(inplace = True)
df8.head()
| LOCATION | TIME | Value | |
|---|---|---|---|
| 0 | ARG | 1990 | 524.747862 |
| 1 | ARG | 1991 | 393.885581 |
| 2 | ARG | 1992 | 420.682693 |
| 3 | ARG | 1993 | 447.727484 |
| 4 | ARG | 1994 | 436.872628 |
df9 = df2.groupby("TIME").mean()
df9.reset_index(inplace = True)
df9.head()
| TIME | Value | |
|---|---|---|
| 0 | 1990 | 506.459977 |
| 1 | 1991 | 616.501783 |
| 2 | 1992 | 689.963810 |
| 3 | 1993 | 794.275811 |
| 4 | 1994 | 837.991369 |
import seaborn as sns, matplotlib.pyplot as plt
sns.set(rc = {'figure.figsize': (25, 25)})
_, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2)
p1 = sns.pointplot(x = "Food product", y = "Land use change", color = "red", data = df3, ax = ax1)
p1.set(xlabel = "FOOD PRODUCT", ylabel = "LAND USE CHANGE")
p1.set_xticklabels(p1.get_xticklabels(), rotation = 90)
p2 = sns.pointplot(x = "Food product", y = "Animal Feed", color = "green", data = df3, ax = ax2)
p2.set(xlabel = "FOOD PRODUCT", ylabel = "ANIMAL FEED")
p2.set_xticklabels(p2.get_xticklabels(), rotation = 90)
p3 = sns.pointplot(x = "Food product", y = "Transport", color = "blue", data = df3, ax = ax3)
p3.set(xlabel = "FOOD PRODUCT", ylabel = "TRANSPORT")
p3.set_xticklabels(p3.get_xticklabels(), rotation = 90)
p4 = sns.pointplot(x = "Food product", y = "Total_emissions", color = "yellow", data = df3, ax = ax4)
p4.set(xlabel = "FOOD PRODUCT", ylabel = "TOTAL EMISSIONS")
p4.set_xticklabels(p4.get_xticklabels(), rotation = 90)
ax1; ax2; ax3; ax4
<matplotlib.axes._subplots.AxesSubplot at 0x7f1c58b48160>
_, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows = 2, ncols = 2)
p1 = sns.barplot(x = "Food product", y = "Eutrophying emissions per 1000kcal (gPO₄eq per 1000kcal)", data = df4, ax = ax1)
p1.set(xlabel = "FOOD PRODUCT", ylabel = "EUTROPHYING EMISSIONS (gPO₄eq per 1000kcal)")
p1.set_xticklabels(p1.get_xticklabels(), rotation = 90)
p2 = sns.barplot(x = "Food product", y = "Freshwater withdrawals per 1000kcal (liters per 1000kcal)", data = df4, ax = ax2)
p2.set(xlabel = "FOOD PRODUCT", ylabel = "WATER USE (liters per 1000kcal)")
p2.set_xticklabels(p2.get_xticklabels(), rotation = 90)
p3 = sns.barplot(x = "Food product", y = "Greenhouse gas emissions per 1000kcal (kgCO₂eq per 1000kcal)", data = df4, ax = ax3)
p3.set(xlabel = "FOOD PRODUCT", ylabel = "GREENHOUSE GAS EMISSIONS (kgCO₂eq per 1000kcal)")
p3.set_xticklabels(p3.get_xticklabels(), rotation = 90)
p4 = sns.barplot(x = "Food product", y = "Land use per 1000kcal (m² per 1000kcal)", data = df4, ax = ax4)
p4.set(xlabel = "FOOD PRODUCT", ylabel = "LAND USE (m² per 1000kcal)")
p4.set_xticklabels(p4.get_xticklabels(), rotation = 90)
ax1; ax2; ax3; ax4
<matplotlib.axes._subplots.AxesSubplot at 0x7f1c586dd1c0>
import folium
from branca.element import Figure
fig = Figure(width = 1250, height = 750)
world_map = folium.Map(location = [42.5, 0], zoom_start = 2)
folium.Choropleth(
geo_data = 'https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson',
data = df7,
columns = ['LOCATION', 'Value'],
key_on = 'properties.ISO_A3',
fill_color = 'YlOrRd',
legend_name = "Relative Meat Consumption",
threshold_scale = [0, 100, 500, 1000, 5000, 10000, 20000]
).add_to(world_map)
fig.add_child(world_map)
fig